Null hypothesis description

How often do we observe specific fold enrichment value among interactors of a viral protein

Probability distribution under the NULL hypothesis: How often specific domain fold enrichment among interactors of a specific viral protein (attribute of a pair viral_protein-human_domain) is observed as compared to any domain fold enrichment among interactors of that viral protein (attribute of a viral_protein, distribution), the latter is derived from permutation mimicking as if viral protein was binding to a different set of human proteins

Using fold enrichent as a statistic accounts for domain frequency in the background distribution. Background distribution of domain frequency in based on all human proteins with known PPI.

One of the ways to describe the problem of finding domains that are likely to mediate interaction of human proteins with viral proteins is by drawing a network which has 3 types of elements: viral proteins (V1, V2, V3), human proteins (H1-10) and human domains (D1-10).

knitr::include_graphics("./images/net_start.jpg")

First, we compute the fraction of human interacting partners of V2 that contain domain D6. Then, we compute the fold enrichment by dividing this fraction by the domain frequency among all proteins in the network.

knitr::include_graphics("./images/net_start_calc.jpg")

Next, we can calculate the frequency of the fraction of human interacting partners of V2 that contain domain D6 (for every viral protein and every human domain).

knitr::include_graphics("./images/net_start_fraq.jpg")

Next, we permute which human proteins interact with V1, V2 and V3, keeping the number of interaction per both viral and human proteins as well as number of edges (interactions) constant

knitr::include_graphics("./images/net_scram.jpg")

and compute the fraction of new random human interacting partners of V2 that contain any domain

knitr::include_graphics("./images/net_scram_fraq.jpg")

Finally, we compare each real fold enrichment value to the distribution of fold enrichment of randomly associated domains. For each human domain-viral protein value we calculate the fraction of the (random) distribution that is as high or higher than the real value. We repeat this procedure 100 000 times averaging out the fraction which gives us the probability of specific human domain - specific viral protein co-occurence if the domain and the protein were unrelated.

knitr::include_graphics("./images/net_start_fraq_plot.jpg")

knitr::include_graphics("./images/net_scram_fraq_plot.jpg")

Load network

viral_human_net_w_domains_d = fread("./processed_data_files/viral_human_net_w_domains", sep = "\t", stringsAsFactors = F)

# generate minimal information tables
viral_human_net = unique(viral_human_net_w_domains_d[,.(IDs_interactor_viral, IDs_interactor_human, IDs_interactor_viral_degree)])
datatable(viral_human_net)
domains_proteins = unique(viral_human_net_w_domains_d[,.(IDs_interactor_human, IDs_domain_human, domain_frequency)])
datatable(domains_proteins)
viral_human_net_w_domains = unique(viral_human_net_w_domains_d[,.(IDs_interactor_viral, IDs_interactor_human, IDs_domain_human, fold_enrichment)])
datatable(viral_human_net_w_domains)
## Warning in instance$preRenderHook(instance): It seems your data is too
## big for client-side DataTables. You may consider server-side processing:
## http://rstudio.github.io/DT/server.html

Calculate fold enrichment distribution of each viral protein in any domain and plot a few examples

if(!file.exists("./large_processed_data_files/viral_foldEnrichDist.tsv.gz")){
    # calculate fold enrichment distribution
    proc_time = proc.time()
    viral_foldEnrichDist = foldEnrichmentDist(net = viral_human_net, 
                                              protein_annot = domains_proteins, 
                                              N = 5000, cores = cores_to_use, seed = 3245)
    tot_time = proc.time() - proc_time
    fwrite(viral_foldEnrichDist, "./large_processed_data_files/viral_foldEnrichDist.tsv", sep = "\t")
    gzip("./large_processed_data_files/viral_foldEnrichDist.tsv", remove = T)
}
## Warning in RNGkind("L'Ecuyer-CMRG"): '.Random.seed' is not an integer
## vector but of type 'NULL', so ignored
## 
Written 48.2% of 50762145 rows in 2 secs using 4 threads. anyBufferGrown=no; maxBuffUsed=67%. Finished in 2 secs.      
Written 80.1% of 50762145 rows in 3 secs using 4 threads. anyBufferGrown=no; maxBuffUsed=67%. Finished in 0 secs.      
                                                                                                                                     
tot_time
##    user  system elapsed 
##  85.483   1.975 141.167
gunzip("./large_processed_data_files/viral_foldEnrichDist.tsv.gz", remove = F)
viral_foldEnrichDist = fread("./large_processed_data_files/viral_foldEnrichDist.tsv", sep = "\t", stringsAsFactors = F)
## 
Read 6.3% of 50762145 rows
Read 25.4% of 50762145 rows
Read 44.2% of 50762145 rows
Read 62.0% of 50762145 rows
Read 80.1% of 50762145 rows
Read 97.7% of 50762145 rows
Read 50762145 rows and 2 (of 2) columns from 1.199 GB file in 00:00:08
unlink("./large_processed_data_files/viral_foldEnrichDist.tsv")

# plot a few random cases
set.seed(1)
plotFoldEnrichmentDist(proteinID = sample(unique(viral_human_net$IDs_interactor_viral), 9), 
                       fold_enrichment_dist = viral_foldEnrichDist, 
                       data = viral_human_net_w_domains, text_lab = F)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Calculate P value for each viral protein and each human domain association

if(!file.exists("./processed_data_files/viralProtein_humanDomain_pval.tsv")){
    # calculate pvalue
    proc_time = proc.time()
    Pvals = foldEnrichmentPval(fold_enrichment_dist = viral_foldEnrichDist, 
                               data = viral_human_net_w_domains, cores = cores_to_use)
    tot_time = proc.time() - proc_time
    fwrite(Pvals, "./processed_data_files/viralProtein_humanDomain_pval.tsv", sep = "\t")
}
tot_time
##    user  system elapsed 
##  25.882  16.558 713.461
Pvals = fread("./processed_data_files/viralProtein_humanDomain_pval.tsv", sep = "\t", stringsAsFactors = F)

# plot pvalue distribution
ggplot(Pvals, aes(x = Pval)) + geom_histogram(bins = 100) + ggtitle("viral protein and human domain association \n pvalue distribution") + theme_light() + xlim(0,1)

# calculate fdr adjusted pvalue
Pvals[, Pval_fdr := p.adjust(Pval, method = "fdr")]
ggplot(Pvals, aes(x = Pval_fdr)) + geom_histogram(bins = 100) + ggtitle("viral protein and human domain association \n FDR adjusted pvalue distribution") + theme_light() + xlim(0,1)

# calculate q-value
Pvals_qvalue = Pvals[, qvalue(Pval)]
Pvals[, Qval := qvalue(Pval)$qvalues]
plot(Pvals_qvalue)

ggplot((Pvals), aes(x = Qval)) + geom_histogram(bins = 100) + ggtitle("viral protein and human domain association \n qvalue adjusted pvalue distribution") + theme_light() + xlim(0,1)

# plot fold enrichment vs pvalue
ggplot(Pvals, aes(x = fold_enrichment, y = Pval)) + geom_bin2d() + ggtitle("viral protein and human domain association \n fold enrichment vs p-value") + ylab("p-value") + xlab("fold enrichment") + scale_x_log10()

Relationships between pvalue statistic and protein/domain properties

# merge results to original data
viral_human_net_w_domains_d = viral_human_net_w_domains_d[Pvals, on = c("IDs_interactor_viral", "IDs_domain_human", "fold_enrichment")]

# function to accomodate ggplot2::geom_bin2d in GGally::ggpairs, taken from http://ggobi.github.io/ggally/#custom_functions
d2_bin <- function(data, mapping, ..., low = "#132B43", high = "#56B1F7") {
    ggplot(data = data, mapping = mapping) +
        geom_bin2d(...) +
        scale_fill_gradient(low = low, high = high) +
        scale_y_log10() + scale_x_log10()
}

log10_density = function(data, mapping, ...){
    ggplot(data = data, mapping = mapping) +
        geom_density(...) +
        scale_x_log10()
}

GGally::ggpairs(viral_human_net_w_domains_d[,.(domain_count, domain_frequency, 
                                               Taxid_interactor_viral,
                                               IDs_interactor_viral_degree, 
                                               IDs_interactor_human_degree, 
                                               domain_count_per_IDs_interactor_viral,
                                               domain_frequency_per_IDs_interactor_viral,
                                               fold_enrichment,
                                               Pval, Pval_fdr, Qval)], 
                lower = list(continuous = d2_bin), 
                diag = list(continuous = log10_density)) +
    theme_light() +
    theme(strip.text.y = element_text(angle = 0, size = 10),
          strip.text.x = element_text(angle = 90, size = 10))

hist(viral_human_net_w_domains_d[Pval_fdr < 0.35, domain_count_per_IDs_interactor_viral], main = "number of human proteins with specific domain per viral protein \n the lowest FDR corrected peak (0.35 < pval)", xlab = "how many times domain is repeated")

hist(viral_human_net_w_domains_d[Pval_fdr < 0.35, domain_count], main = "domain prevalence in the background set \n the lowest FDR corrected peak (0.35 < pval)", xlab = "domain count in background")

sessionInfo()
## R version 3.4.0 (2017-04-21)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.6
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] GGally_1.3.1        qvalue_2.8.0        DT_0.2             
##  [4] MItools_0.1.3       Biostrings_2.44.1   XVector_0.16.0     
##  [7] PSICQUIC_1.14.0     plyr_1.8.4          httr_1.2.1         
## [10] biomaRt_2.32.1      IRanges_2.10.2      S4Vectors_0.14.3   
## [13] BiocGenerics_0.22.0 ggplot2_2.2.1       R.utils_2.5.0      
## [16] R.oo_1.21.0         R.methodsS3_1.7.1   data.table_1.10.4  
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.11               lattice_0.20-35           
##  [3] Rsamtools_1.28.0           rprojroot_1.2             
##  [5] digest_0.6.12              R6_2.2.2                  
##  [7] GenomeInfoDb_1.12.2        backports_1.1.0           
##  [9] RSQLite_2.0                evaluate_0.10             
## [11] zlibbioc_1.22.0            rlang_0.1.1               
## [13] lazyeval_0.2.0             blob_1.1.0                
## [15] Matrix_1.2-10              rmarkdown_1.6             
## [17] gsubfn_0.6-6               labeling_0.3              
## [19] proto_1.0.0                splines_3.4.0             
## [21] BiocParallel_1.10.1        stringr_1.2.0             
## [23] htmlwidgets_0.8            RCurl_1.95-4.8            
## [25] bit_1.1-12                 munsell_0.4.3             
## [27] DelayedArray_0.2.7         compiler_3.4.0            
## [29] rtracklayer_1.36.3         htmltools_0.3.6           
## [31] SummarizedExperiment_1.6.3 tibble_1.3.3              
## [33] GenomeInfoDbData_0.99.0    matrixStats_0.52.2        
## [35] XML_3.98-1.9               reshape_0.8.6             
## [37] GenomicAlignments_1.12.1   bitops_1.0-6              
## [39] grid_3.4.0                 jsonlite_1.5              
## [41] gtable_0.2.0               DBI_0.7                   
## [43] magrittr_1.5               scales_0.4.1              
## [45] stringi_1.1.5              reshape2_1.4.2            
## [47] RColorBrewer_1.1-2         tools_3.4.0               
## [49] bit64_0.9-7                Biobase_2.36.2            
## [51] yaml_2.1.14                AnnotationDbi_1.38.1      
## [53] colorspace_1.3-2           GenomicRanges_1.28.3      
## [55] memoise_1.1.0              knitr_1.16